Uploaded image for project: 'Apache Storm'
  1. Apache Storm
  2. STORM-2176

Workers do not shutdown cleanly and worker hooks don't run when a topology is killed

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.0.0, 1.0.1, 1.0.2
    • 2.0.0, 1.1.0, 1.0.3
    • None
    • None

    Description

      This appears to have been introduced in the 1.0.0 release. The issues does not seem to affect 0.10.2.

      When a topology is killed and workers receive the notification to shutdown, they do not shutdown cleanly, so worker hooks never get invoked.

      When a worker shuts down cleanly, the worker logs should contain entries such as the following:

      2016-10-28 18:52:06.273 b.s.d.worker [INFO] Shut down transfer thread
      2016-10-28 18:52:06.279 b.s.d.worker [INFO] Shutting down default resources
      2016-10-28 18:52:06.287 b.s.d.worker [INFO] Shut down default resources
      2016-10-28 18:52:06.351 b.s.d.worker [INFO] Disconnecting from storm cluster state context
      2016-10-28 18:52:06.359 b.s.d.worker [INFO] Shut down worker exclaim-1-1477680593 61bddd66-0fda-4556-b742-4b63f0df6fc1 6700
      

      In the 1.0.x line of releases (and presumably 1.x, though I haven't checked) this does not happen – the worker shutdown process appears to get stuck shutting down executors (https://github.com/apache/storm/blob/v1.0.2/storm-core/src/clj/org/apache/storm/daemon/worker.clj#L666), no further log messages are seen in the worker log, and worker hooks do not run.

      There are two properties that affect how workers exit. The first is the configuration property supervisor.worker.shutdown.sleep.secs, which defaults to 1 second. This corresponds to how long the supervisor will wait for a worker to exit gracefully before forcibly killing it with kill -9. When this happens the supervisor will log that the worker terminated with exit code 137 (128 + 9).

      The second property is a hard-coded 1 second delay (https://github.com/apache/storm/blob/v1.0.2/storm-core/src/clj/org/apache/storm/util.clj#L463) added as a shutdown hook that will call Runtime.halt() if the delay is exceeded. When this happens, the supervisor will log that the worker terminated with exit code 20 (hard-coded).

      Side Note: The hardcoded halt delay in worker.clj and the default value for supervisor.worker.shutdown.sleep.secs both being 1 second should probably be changed since it creates a race to see whether the supervisor delay or the worker delay wins.

      To test this, I set supervisor.worker.shutdown.sleep.secs to 15 to allow plenty of time for the worker to exit gracefully, and deployed and killed a topology. In this case the supervisor consistently reported exit code 20 for the worker, indicating the hard-coded shutdown hook caused the worker to exit.

      I thought the hard-coded 1 second shutdown hook delay might not be long enough for the worker to shutdown cleanly. To test that hypothesis, I changed the hard-code delay to 10 seconds, leaving supervisor.worker.shutdown.sleep.secs at 15 seconds. Again supervisor reported an exit code of 20 for the worker, and there were no log messages indicating the worker had exited cleanly and that the worker hook had run.

      Attachments

        Issue Links

          Activity

            People

              kabhwan Jungtaek Lim
              ptgoetz P. Taylor Goetz
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h